Search CORE

9,245 research outputs found

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato

Digital libraries for the developing world

Author: Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Digital libraries (DLs) are the killer app for information technology in developing countries. Priorities here include health, agriculture, nutrition, hygiene, sanitation, and safe drinking water. Computers are not a priority, but simple, reliable access to targeted information meeting these basic needs certainly is. DLs can assist human development by providing a non-commercial mechanism for distributing humanitarian information on topics such as health, agriculture, nutrition, hygiene, sanitation, and water supply. Many other areas, ranging from disaster relief to medical education, from the preservation and propagation of indigenous culture to educational material that addresses specific community problems, also benefit from new methods of information distribution

Research Commons@Waikato

Customizing digital library interfaces with Greenstone

Author: Witten Ian H.
Publication venue: IEEE Computer Society
Publication date: 01/01/2003
Field of study

Digital libraries are organized, focused collections of information. They are focused on a particular topic or theme—and good digital libraries will articulate the principles governing what is included. They are organized to make information accessible in particular, well-defined, ways—and good ones will include a description of how the information is organized (Lesk, 1997). The Greenstone digital library software is intended to help users construct simple collections of information very quickly. Indeed, only a few minutes of the user's time are needed to set up a collection based on a standard design and initiate the building process. Collections may be large—some comprise Gbytes of text; millions of documents. Furthermore, even larger volumes of information may be associated with a collection—typically audio, image, and video, with textual metadata. Once initiated, the mechanical process of building the collection may take from a few moments for a tiny collection to several hours for a multi-Gbyte one—perhaps even a day if it involves many different full-text indexes

Research Commons@Waikato

Creating and customizing digital library collections with the Greenstone Librarian Interface

Author: Witten Ian H.
Publication venue: 'Institute of Mathematics, University of Tsukuba'
Publication date: 01/01/2004
Field of study

The Greenstone digital library software is a comprehensive system for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet. This paper describes how digital library collections can be created and customized with the new Greenstone Librarian Interface. Its basic features allow users to add documents and metadata to collections, create new collections whose structure mirrors existing ones, and build collections and put them in place so for users to view. More advanced users can design and customize new collection structures. At the most advanced level, the Librarian Interface gives expert users interactive access to the full power of Greenstone, which could formerly be tapped only by running Perl scripts manually

Research Commons@Waikato

Classification

Author: Witten Ian H.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

In Classification learning, an algorithm is presented with a set of classified examples or ‘‘instances’’ from which it is expected to infer a way of classifying unseen instances into one of several ‘‘classes’’. Instances have a set of features or ‘‘attributes’’ whose values define that particular instance. Numeric prediction, or ‘‘regression,’’ is a variant of classification learning in which the class attribute is numeric rather than categorical. Classification learning is sometimes called supervised because the method operates under supervision by being provided with the actual outcome for each of the training instances. This contrasts with Data clustering (see entry Data Clustering), where the classes are not given, and with Association learning (see entry Association Learning), which seeks any association – not just one that predicts the class

Research Commons@Waikato

The Development and Usage of the Greenstone Digital Library Software

Author: Witten Ian H.
Publication venue: ASIS&T
Publication date: 01/01/2008
Field of study

The Greenstone software has helped spread the practical impact of digital library technology throughout the world-particularly in developing countries. This article reviews the project’s origins, usage, and the development of support mechanisms for Greenstone users. We begin with a brief summary of salient aspects of this open source software package and its user population. Next we describe how its international, humanitarian focus arose. We then review the special requirements imposed by the conditions that prevail in developing courtiers. Finally we discuss efforts to establish regional support organizations for Greenstone in India and Africa

Research Commons@Waikato

Categories of holomorphic line bundles on higher dimensional noncommutative complex tori

Author: Hiroshige Kajiura
Li H.
Witten E.
Witten E.
Publication venue: 'AIP Publishing'
Publication date: 01/01/2005
Field of study

We construct explicitly noncommutative deformations of categories of holomorphic line bundles over higher dimensional tori. Our basic tools are Heisenberg modules over noncommutative tori and complex/holomorphic structures on them introduced by A. Schwarz. We obtain differential graded (DG) categories as full subcategories of curved DG categories of Heisenberg modules over the complex noncommutative tori. Also, we present the explicit composition formula of morphisms, which in fact depends on the noncommutativity.Comment: 28 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Kyoto University Research Information Repository

CERN Document Server

Thesaurus-based index term extraction for agricultural documents

Author: Medelyan Olena
Witten Ian H.
Publication venue: EFITA/WICCA
Publication date: 01/01/2005
Field of study

This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction

CiteSeerX

Research Commons@Waikato

Strong-Electroweak Unification at About 4 TeV

Author: Paul H. Frampton
Witten E.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 05/08/2002
Field of study

It is shown how an

SU(3)^{N}

nonsupersymmetric quiver gauge theory can accommodate the standard model with three chiral families and unify all of

SU(3)_C

SU(2)_L

and

U(1)_Y

couplings with high accuracy at one unique scale estimated as

M \simeq 4

TeV.Comment: 3 pages LaTeX. Typos corrected. Text and references adde

arXiv.org e-Print Archive

Crossref

Carolina Digital Repository

CERN Document Server

Topological Charge Membranes in 2D and 4D Gauge Theory

Author: H. Thacker
Hasenfratz
Horváth
J. Lenaghan
Luscher
S. Ahmad
Witten
Witten
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Local topological charge structure in the 2D CP(N-1) sigma models is studied using the overlap Dirac operator. Long-range coherence of topological charge along locally 1D regions in 2D space-time is observed. We discuss the connection between these results and the recent discovery of coherent 3D sheets of topological charge in 4D QCD. In both cases, coherent regions of topological charge form along surfaces of approximmate codimension 1.Comment: Lattice2004(topology

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server